snapshots refactoring #1970

battlmonstr · 2024-04-16T16:23:28Z

Previously the Snapshot subclasses were responsible for everything:

snapshot serialization formats
related indexes management
data access layer

Most of the logic was copy-pasted 3 times with minor variations.

Now the Snapshot logic is reorganized to decouple storage engine from the Ethereum domain objects.
The copy-paste is reduced.
The same design structure can be applied to support more Snapshot/Index types in the future.

The Repository data model is changed to contain a continuous sequence of full snapshot bundles (see (5) below).

Changes:

SnapshotWordDeserializer represents the snapshot word format. Multiple deserializers can be used for the same .seg (e.g. to get words as raw RLP).
Snapshot represents a .seg file read-only view and provides iteration over entries with a custom SnapshotWordDeserializer.

Example to open a snapshot:

Snapshot snapshot{path};
snapshot.reopen_segment();

Index is decoupled from Snapshot. It is a wrapper of RecSplitIndex.

Example to open an index:

Index index{path.index_file()};
index.reopen_index();

Data access layer is decoupled from Snapshot and consists of snapshot readers and query objects.

A snapshot reader is a type-safe iterator over Snapshot words. A reader is bound with a particular SnapshotWordDeserializer. Readers don't depend on indexes. They assume that the snapshot was open for reading.

Example to iterate over headers:

for (const BlockHeader& header : HeaderSnapshotReader{snapshot}) { ... }

Each query is a separate struct. Queries usually depend on indexes. In this case they assume that the index was open for reading.

Common queries are implemented in basic_queries.hpp: FindByIdQuery, FindByHashQuery, RangeFromIdQuery. They are then bound with a particular reader to produce results:

struct HeaderFindByHashQuery : public FindByHashQuery<HeaderSnapshotReader> {}

Example to find a header by hash:

std::optional<BlockHeader> header_opt = HeaderFindByHashQuery{snapshot, index}.exec(hash);

Some queries have custom logic, for example: BodyTxsAmountQuery.

Since the indexes are decoupled, they have to be stored separately in the Repository. In order to cope with complexity I've decided to simplify Repository to a different model. Previously the repository had snapshots organized by type, then by paths, and inside each snapshot it contained related indexes. It is changed to a sequence of full bundles organized by block_from sequentially and continuously. A full bundle must have all snapshot and index types for a given block range. It means that Repository doesn't handle "partial bundles" anymore, and is more strict to not having gaps. SnapshotBundle is like a database schema:

struct SnapshotBundle {
    Snapshot header_snapshot;
    Index idx_header_hash;
    
    Snapshot body_snapshot;
    Index idx_body_number;
    
    Snapshot txn_snapshot;
    Index idx_txn_hash;
    Index idx_txn_hash_2_block;
}

…th_metadata optional

use ptrdiff_t difference_type in iterators

…Sync

…xists

battlmonstr · 2024-04-25T14:53:01Z

Making read_senders=true always has no tangible effect on performance.
This test on a 40 Gb transactions file shows it (CMAKE_BUILD_TYPE=RelWithDebInfo):

TEST_CASE("tt") {
    Snapshot snapshot{*SnapshotPath::parse("/Volumes/WD4K/mainnet/snapshots/v1-017500-018000-transactions.seg")};
    snapshot.reopen_segment();
    TransactionSnapshotReader reader{snapshot};
    for ([[maybe_unused]] auto& t : reader) {}
}

with tx.set_sender() in decode_word_into_tx():

run 1: 3:10
run 2: 3:06
run 3: 3:10

without tx.set_sender()/senders_data in decode_word_into_tx():

run 1: 3:06
run 2: 3:05
run 3: 3:11

battlmonstr · 2024-04-28T12:18:42Z

@canepat I was able to execute pre-downloaded snapshots from scratch to the last block I have in snapshots (18.8M - 1) locally on macOS. It took about 4 days.

  INFO [04-28|14:42:08.162 UTC] [12/12 Finish]                           op=Forward from=0 to=18799999 span=18799999 
  INFO [04-28|14:42:08.163 UTC] ExecutionPipeline                        Forward done
  INFO [04-28|14:42:08.257 UTC] PoSSync: Waiting for blocks... from=18'799'999

canepat · 2024-04-29T21:27:29Z

Great work!

battlmonstr force-pushed the pr/snap_ref2 branch 12 times, most recently from e7edb0f to 01c1d2a Compare April 19, 2024 13:59

battlmonstr requested a review from canepat April 19, 2024 15:20

battlmonstr marked this pull request as ready for review April 19, 2024 15:21

battlmonstr added snapshots Framework for BitTorrent-based snapshots maintenance Some maintenance work (fix, refactor, rename, test...) labels Apr 19, 2024

battlmonstr force-pushed the pr/snap_ref2 branch from 01c1d2a to 73ca1ad Compare April 22, 2024 13:10

battlmonstr added 13 commits April 24, 2024 15:13

extract header serialization

51b9e41

extract body serialization

0972bb7

refactor slice_tx_payload

5ac4cb0

rename txn_hash -> txn_snapshot_word_serializer

361e887

extract tx serialization

fe4644b

snapshot iterator using SnapshotWordSerializer

55094c3

move Snapshot class to snapshot_base.hpp

dacc3d8

refactor SnapshotWordSerializer-s: rename value, make check_sanity_wi…

1b44c19

…th_metadata optional

HeaderSnapshotReader

c465358

rename snapshot_base -> snapshot_reader

a6bb467

BodySnapshotReader

41ff53f

refactor next_item/seek, TransactionSnapshotReader

73f7760

walkers using references

8caa0d5

battlmonstr added 4 commits April 24, 2024 15:13

refactor query call sites to use query objects

acc047a

remove path copies from SnapshotBundle

52f6858

refactor to separate Index-es from snapshots

3733503

rename serializer -> deserializer

aa3b101

battlmonstr force-pushed the pr/snap_ref2 branch from 73ca1ad to cf43085 Compare April 24, 2024 13:13

battlmonstr added 12 commits April 24, 2024 16:34

SnapshotReader: mutable iterators and fix move support

f61a88f

static_assert iterators

f9aa2ec

use ptrdiff_t difference_type in iterators

simplify queries and readers with typedefs

22e8ae5

concepts for deserializers and readers

d54cf6e

remove read_senders from snapshots access_layer

401cef8

rename ordinal_lookup -> lookup_by_ordinal

95ff21d

lookup_by_data_id: return optional, check upper bound

1226e02

simplify SnapshotSync::download_and_index_snapshots

e7a9c79

remove SnapshotRepository::view_segment

668937e

move stale index removal logic to repository and use it from Snapshot…

f8bb9ef

…Sync

reopen_index: throw if path is not found

0eaf8cd

make indexes mandatory in reopen

c553693

battlmonstr force-pushed the pr/snap_ref2 branch 3 times, most recently from 35c156e to 0b449ef Compare April 24, 2024 16:33

simplify count/max computations

d09764a

battlmonstr force-pushed the pr/snap_ref2 branch from 0b449ef to d09764a Compare April 25, 2024 08:50

battlmonstr added 2 commits April 25, 2024 10:53

demote log::Info to Debug: ETL collector flushed file

7d1b167

rebuild a missing transactions_to_block index if transactions index e…

14c4ef9

…xists

canepat approved these changes Apr 29, 2024

View reviewed changes

canepat merged commit 20336c9 into master Apr 29, 2024
5 checks passed

canepat deleted the pr/snap_ref2 branch April 29, 2024 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snapshots refactoring #1970

snapshots refactoring #1970

battlmonstr commented Apr 16, 2024 •

edited

Loading

battlmonstr commented Apr 25, 2024

battlmonstr commented Apr 28, 2024 •

edited

Loading

canepat commented Apr 29, 2024

snapshots refactoring #1970

snapshots refactoring #1970

Conversation

battlmonstr commented Apr 16, 2024 • edited Loading

Changes:

battlmonstr commented Apr 25, 2024

battlmonstr commented Apr 28, 2024 • edited Loading

canepat commented Apr 29, 2024

battlmonstr commented Apr 16, 2024 •

edited

Loading

battlmonstr commented Apr 28, 2024 •

edited

Loading